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ABSTRACT 

The analysis of 2000 bacterial genomes revealed 
that they all, without a single exception, encode one 
or more DNA polymerase III oc-subunit (Polllla) 
homologs. Classified into C-family of DNA polymer- 
ases they come in two major forms, PolC and DnaE, 
related by ancient duplication. While PolC repre- 
sents an evolutionary compact group, DnaE can be 
further subdivided into at least three groups (DnaE1- 
3). We performed an extensive analysis of various 
sequence, structure and surface properties of all 
four polymerase groups. Our analysis suggests a 
specific evolutionary pathway leading to PolC and 
DnaE from the last common ancestor and reveals 
important differences between extant polymerase 
groups. Among them, DnaE1 and PolC show the 
highest conservation of the analyzed properties. 
DnaE3 polymerases apparently represent an 
'impaired' version of DnaE1. Nonessential DnaE2 
polymerases, typical for oxygen-using bacteria 
with large GC-rich genomes, have a number of 
features in common with DnaE3 polymerases. The 
analysis of polymerase distribution in genomes 
revealed three major combinations: DnaE1 either 
alone or accompanied by one or more DnaE2s, 
PolC + DnaE3 and PolC + DnaE1. The first two 
combinations are present in Escherichia coli and 
Bacillus subtilis, respectively. The third one 
(PolC + DnaE1), found in Clostridia, represents a 
novel, so far experimentally uncharacterized, set. 

INTRODUCTION 

DNA polymerase III is a tripartite protein machine 
responsible for replication of bacterial genome (1-5). 



It consists of a DNA polymerase, its processivity factor 
p-clamp and a clamp loader complex. The actual DNA 
synthesis is performed by the polymerase III a-subunit 
(Polllla), classified into the C-family of DNA polymer- 
ases (6). Surprisingly, bacterial Polllla subunits are both 
structurally and evolutionary distinct from eukaryotic and 
archaeal replicative DNA polymerases (7,8) that belong to 
the B-family. Instead, the Polllla catalytic domain is dis- 
tantly related to the X-family of DNA polymerases (7,8), 
exemplified by eukaryotic Poip, a polymerase acting in 
DNA excision repair (9,10). It should be noted that this 
unexpected relationship could not be detected by protein 
sequence comparison and only became apparent in the 
context of 3D structures (7,8). Although polymerases of 
C and X families are not globally similar, a strong case for 
their common evolutionary origin could be made based on 
the observation that they share a common fold of corres- 
ponding 'palm' domains and bind DNA in the same 
manner (11). In contrast, 'palm' domains of DNA poly- 
merases belonging to A, B and Y families have entirely 
different fold. Taken together, these findings lend add- 
itional support for the hypothesis that bacterial replicative 
polymerases (C-family) on one hand and archaeal/eukary- 
otic replicative polymerases (B-family) on the other hand 
have evolved as components of two independent DNA 
replication systems (12). Another interesting observation 
is that C-family polymerases are essentially confined to the 
bacteria kingdom. Only a handful of Polllla homologs 
have been detected in bacteriophages, which predomin- 
antly use B-family (and to lesser extent A-family) DNA 
polymerases (13,14). One of the explanations for the 
scarcity of Polllla homologs even in bacteria-infecting 
viruses is that the C-family is evolutionary 'young' 
compared with the B-family (13). Owing to their relatively 
late emergence, C-family DNA polymerases might have 
failed to make a significant imprint in the B-family- 
dominated viral landscape (13), and a few instances of 
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C-family members in bacteriophages might be the result of 
lateral gene transfer events from bacteria (15). 

PolIIIa subunits come in two major forms, DnaE (7,8) 
and PolC (16). A typical example of DnaE is PolIIIa of 
the extensively studied model organism, Gram-negative 
bacterium Escherichia coli. PolC is present in low-GC 
Gram-positive bacteria such as Bacillus subtilis. The two 
different PolIIIa forms are thought to be the result of an 
ancient gene duplication event predating the radiation of 
Gram-positive and Gram-negative bacteria (17). The two 
forms undoubtedly share common evolutionary origin, yet 
they differ by the exact composition and the arrangement 
of structural domains (7,8,16). Both have the polymerase 
and histidinol phosphatase (PHP) domain, the polymerase 
core consisting of 'palm 1 , 'thumb 1 and 'fingers', and the 
tandem helix-hairpin-helix (HhH) 2 motif followed by the 
(3-clamp binding motif, all arranged in the same order 
(Figure 1). When it comes to differences, the oligonucleo- 
tide binding (OB) domain present in both forms is 
embedded in the opposite regions of the polypeptide 
chain. In DnaE-type PolIIIa, it is C-terminal to the (3- 
clamp binding motif, while in PolC, it is N-terminal to 
the PHP domain. In addition, DnaE and PolC possess 
structural domains unique to each form. DnaE has a 
small structural domain at the extreme C-terminus. In 
contrast, PolC has an unrelated N-terminal domain 
(NTD) predicted to have two type II KH-like subdomains 
(18). Moreover, PolC has an integral proofreading 3'-5' 
exonuclease domain inserted into the PHP domain, while 
the DnaE proofreading exonuclease activity is provided 
by e, a separate subunit. 



If the number of distinct PolIIIa subunits and their 
role in a bacterial cell are considered, there also are 
notable differences. The widely studied E. coli encodes a 
sole DnaE-type PolIIIa subunit, which performs DNA 
synthesis of both leading and lagging strands (1,19,20). 
However, this is not a universal situation in the bacterial 
world. For example, low-GC Gram-positive bacteria were 
found to have both PolC and DnaE (17). Experiments 
with B. subtilis and some other Gram-positive bacteria 
showed that both types of PolIIIa subunits are essential 
(21-23). Initially, it was thought that PolC and DnaE are 
leading and lagging strand polymerases, respectively (21). 
However, more recently, in vitro experiments with the 
reconstituted B. subtilis replisome (24) revealed a different 
picture of their division of labor. It turned out that DnaE 
makes an initial extension of the RNA primer on both 
strands and then PolC takes over for rapid synthesis of 
long stretches of DNA (24). In this regard, B. subtilis 
DnaE is reminiscent of eukaryotic Pol a, which extends 
the RNA primer and then makes way for a processive 
replicase (25). Some bacteria have a second copy of 
DnaE, usually referred to as DnaE2. So far, genetic 
studies targeting dnaE2, all without a single exception, 
identified it as a nonessential gene (26-32), indicating 
that DnaE2 is not required for chromosomal DNA repli- 
cation. Instead, DnaE2 has been associated with DNA 
damage-inducible error-prone translesion DNA synthesis 
(TLS) (26-28,31,32). In genomes, dnaE2 is typically found 
as part of LexA-regulated contiguous or split multigene 
cassette, which includes two other genes, imuA/imuA' and 
imuB (27,33,34). The two genes encode catalytically 
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Figure 1. Structural organization of DnaE and PolC forms of C-family DNA polymerases. Crystal structures of T. aquaticus DnaE (left, PDB ID: 
3E0D) and G. kaustophilus PolC (right, PDB ID: 3F2B) complexes with the DNA displayed in same orientation. Protein structures are shown as 
solvent accessible surfaces with different structural modules shown in different colors. The missing NTD and the exonuclease domain (Exo) in PolC 
structure are represented correspondingly by a pair of ellipses (purple) and a filled circle (brown). Linear domain organization for both polymerase 
forms is indicated at the bottom. Domain labels: PHP, the polymerase and histidinol phosphatase domain; Pol3, the polymerase core consisting 
of 'palm', 'thumb' and 'fingers'; HhH, the tandem helix-hairpin-helix motif; OB, oligonucleotide binding domain; CTD, C-terminal domain; NTD, 
N-terminal domain consisting of two subdomains; Exo, an integral proofreading 3'-5' exonuclease domain. Colors for individual domains correspond 
to those in structural representation. The P-clamp binding motifs are indicated by black arrows. 
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inactive homologs of RecA and Y-family DNA polymer- 
ase, respectively (29,35). In the most detailed study to 
date, it was shown that in Mycobacterium tuberculosis, 
both genes along with dnaE2 are necessary for induced 
mutagenesis, but the error-prone TLS is directly linked 
to DnaE2 (35). Nonetheless, the role of DnaE2 as an 
error-prone TLS polymerase might not be general. For 
example, the dnaE2 disruption in two Pseudomonas 
species {Pseudomonas aeruginosa and Pseudomonas 
putida) produces opposite effects (28,29). Moreover, a 
recent study was unable to associate any phenotype with 
the Streptomyces coelicolor dnaE2 mutant (30). 
Streptomyces coelicolor dnaE2 was found to be SOS-indu- 
cible, but it was dispensable for DNA replication, linear 
chromosome end patching, ultraviolet resistance or muta- 
genesis. Whether the observed dnaE2 phenotypic differ- 
ences reflect intrinsic DnaE2 properties or the 
differentially controlled access of DnaE2 to the sites of 
DNA synthesis remains unclear. 

Except for a handful of bacterial species, C-family DNA 
polymerases have not been studied in detail experimen- 
tally. On the other hand, the increasing availability of 
bacterial genome sequences provides a possibility to 
explore the diversity and distribution of PolIIIa subunits 
in bacteria using computational methods. An earlier 
survey of annotated PolIIIa subunits within 159 fully 
sequenced bacterial genomes (36) partitioned C-family 
DNA polymerases into four major groups, namely, PolC 
and three DnaE groups (DnaEl, DnaE2 and DnaE3). The 
survey has also found that different types of PolIIIa 
subunits have different preferred combinations within 
bacterial genomes suggesting different degree of versatility 
and mutual compatibility for individual groups of 
C-family polymerases (36). 

In this study, we performed a comprehensive analysis of 
C-family DNA polymerases (putative PolIIIa subunits and 
their homologs) identified in a much larger sample (close to 
2000) of complete bacterial genomes. We took advantage 
of the available 3D structures of PolIIIa representatives 
and surveyed various sequence, structure and surface 
properties as well as their differences within and between 
distinct groups of C-family polymerases. Among other 
things, the results enabled us to suggest a specific evolu- 
tionary pathway leading to the emergence of DnaE and 
PolC from the common ancestor. We also surveyed the 
combinations of PolIIIa homologs found in genomes in 
an attempt to get more clues about functional properties 
of different polymerase groups and a deeper insight into 
the evolution of bacterial replication systems. As a result, 
in addition to two typical replication systems represented 
correspondingly by E. coli and B. subtilis, our analyses 
suggested the existence of a third, so far uncharacterized, 
replication system in Clostridia. 

MATERIALS AND METHODS 

Bacterial genomes and protein sequence data 

Annotated complete bacterial genomes (Supplementary 
Table SI) were obtained from NCBI (ftp://ftp.ncbi.nlm. 
nih.gov/genomes/Bacteria/). The associated data on 



bacterial taxonomic classification, various physiological 
properties, metabolic features and habitats were obtained 
from the Integrated Microbial Genomes database (http:// 
img.jgi.doe.gov/). C-family DNA polymerases (DNA 
PolIIIas and their homologs) were identified by performing 
protein sequence searches with PSI-BLAST (37) against 
the protein database derived from the collected bacterial 
genomes. PSI-BLAST searches were run until convergence 
(E-value = le-03 inclusion threshold) using conserved 
polymerase regions (corresponding to the 324-788 region 
of the E. coli PolIII a-subunit) of C-family representatives 
as search probes. To make sure that no unannotated se- 
quences were missed, representatives of C-family were also 
used to search the collection of genomic sequences using 
TBLASTN (E-value = le-03 significance threshold). 
Results of both types of searches were combined. If a poly- 
merase sequence contained an intein, it was excised before 
further analysis. A number of cyanobacterial polymerase 
sequences are split by an intein (38). Such sequences were 
joined before the intein removal. A small number of se- 
quences were found fragmented, usually due to frameshifts 
that at least in some cases could be the result of sequencing/ 
assembling errors. Therefore, if closely related sequences 
were found intact in other genomes, the fragmented se- 
quences were also reconstructed to avoid false-negative 
results in polymerase distribution studies. The N-termini 
of some sequences, missing as a result of wrong selection of 
the translation start site, were corrected based on compari- 
son with close homologs. Any remaining sequence frag- 
ments without an intact polymerase active site were 
removed from further analysis. 

A nonredundant set of genomes was constructed as 
follows. If any two genomes encoded the same number of 
C-family polymerases and corresponding polymerase se- 
quences in both of them were >90% identical [determined 
by clustering with CD-HIT (39)], only one of the two 
genomes was included in the nonredundant set. If the 
number of C-family polymerases encoded in the two 
genomes was different, both genomes were included inde- 
pendently of sequence similarity. As a result, groups of 
closely related genomes (typically different strains of the 
same species) were represented by a single genome, signifi- 
cantly reducing the redundancy of genomic data. 

Multiple sequence alignments and the analysis of sequence 
features 

Multiple sequence alignments in all analyses were con- 
structed with MAFFT (40) using the accuracy-oriented 
mode (L-INS-i). Predictions of protein secondary struc- 
ture and disordered regions were carried out using, 
respectively, PSIPRED (41) and DISOPRED (42). 
Theoretical isoelectric points (pis) for protein sequences 
were calculated using the 'Compute pI/Mw' tool on the 
ExPASy server (43). 

Phylogenetic analysis 

Initially, full-length sequences of the nonredundant set 
were aligned. The highly conserved PHP-Pol3-(HhH) 2 
region was then excised and the corresponding sequence 
regions were realigned. The alignment was then reduced 
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by removing positions with at least 50% gaps to increase 
the signal-to-noise ratio of the subsequent phylogenetic 
analysis. A maximum-likelihood phylogenetic tree was 
constructed using RAxML program (44). The tree was 
constructed using Le and Gascuel model of amino acid 
substitution (45) with the use of the r model of rate het- 
erogeneity. This model was selected as best fitting the 
analyzed sequences using the program ProtTest (46), run 
with standard parameters. The best tree was selected from 
160 distinct tree inferences. The reliability of subtrees was 
inferred using the widely accepted bootstrap method (47) 
as implemented in RAxML. The bootstrap support values 
were obtained after 1000 generations. Tree analysis and 
visualization were carried out using Dendroscope (48) 
and iTOL (49). 

Identification of conserved structural domains 

Conserved structural-functional domains in polymerase 
sequences were identified using three different approaches 
designed to maximize the sensitivity of domain detection. 

First, E. coli DnaE and B. subtilis PolC sequence regions 
corresponding to structural-functional domains defined 
based on DnaE and PolC crystal structures (Figure 1) 
were used as queries for PSI-BLAST searches (up to five 
iterations; the inclusion threshold E- value = le-02) 
against polymerase sequences. Significant sequence 
matches (E-value< le-03, at least 40% of query sequence 
aligned, 10% or higher sequence identity) were retained. 

Next, for each of the PolIIIa domains, the sequences 
identified in the PSI-BLAST searches and filtered to 
90% sequence identity were used to construct domain- 
specific Hidden Markov Models (HMMs). HMMs con- 
structed from MAFFT alignments using the HMMer 
software suite (50) were appended to the full HMM col- 
lection of PFAM (51) domains. Every polymerase 
sequence was then scanned against the 'extended' PFAM 
database with hmmscan (HMMer suite). Matches with 
E-value< le-03 were considered significant. 

The two above approaches were sufficiently sensitive to 
identify Pol3, PHP, (HhH) 2 and Exo domains. At the 
same time, for a number of polymerase sequences, no 
conserved domains with statistically significant values 
were identified in the N- and C-terminal regions, where 
at least OB domain and CTD (in the case of DnaE) or 
NTD and OB domain (for PolC) could be expected. To 
further test the domain composition of these terminal 
regions, HHsearch (52), a more sensitive homology detec- 
tion method based on the HMM-HMM comparison, was 
used. To this end, domain-specific HMMs were con- 
structed from the updated multiple sequence alignments 
for NTD, CTD and OB domains using the HHsearch suite 
instead of HMMer. The resulting HMMs were then 
appended to the HHsearch-specific PFAM domain 
database (53). For every query, an HMM was constructed 
using three iterations (E-value = le-03) of either HHblits 
(54) against nr20 (the NCBI nonredundant protein 
sequence database, filtered to the maximum of 20% 
sequence identity) or PSI-BLAST against nr80. Query 
HMM was then used to search the 'extended' PFAM 
domain HMM database. Iterative methods (e.g. 



PSI-BLAST) often tend to overextend alignments into 
neighboring nonhomologous domains producing false- 
positive matches (55). To avoid including more than a 
single domain into the query alignment and subsequently 
into the corresponding query HMM, the following pro- 
cedure was used. For unassigned sequence regions, a short 
fragment of terminal 50 residues was initially used as a 
query. If none of the conserved domains were detected, 
additional searches were performed by gradually extend- 
ing the query region. After each extension, the results were 
inspected for the presence of false positives (already 
assigned neighboring domains, e.g. OB domain). The 
domain assignment was considered reliable (true 
positive) if its HHsearch probability was >90% for this 
particular domain and at the same time at most 30% 
probability for unrelated domains. 

Analysis of functional motifs 

For full-length sequences of each polymerase group, 
separate multiple sequence alignments were generated. 
The alignments were inspected visually and adjusted 
manually if necessary. For each analyzed motif, relative 
positions from multiple sequence alignments for each 
group were extracted. In cases where multiple noncon- 
secutive positions were analyzed (e.g. the metal binding 
site of the PHP domain), concatenated alignments for 
each group were made. The WebLogo (56) representation 
of resulting alignments was used to visualize the distribu- 
tion of residues (or deletions, represented by gray squares) 
in each position. 

Analysis of residue conservation and 3D structure surface 
properties 

All analyzed polymerases with no known 3D structure 
were modeled using homology modeling approach. 
Structures of Thermits aquaticus DnaEl (PDB ID: 3E0D) 
and Geobacillus kaustophilus PolC (PDB ID: 3F2B) with 
bound DNA were used as templates to generate models for 
DnaE and PolC polymerases accordingly. Sequence-struc- 
ture alignments were generated using HHsearch. The 3D 
structural models were constructed with Modeller (57) and 
then evaluated using Prosa2003 (58) and visual inspection. 
In the case of visible flaws, models were iteratively refined 
(59). The analysis of surface residue conservation was per- 
formed using the ConSurf (60) server supplied with locally 
constructed multiple sequence alignments for each of the 
analyzed groups. Surface electrostatic properties of all 
structures were computed using the APBS server (61). 
Before electrostatics calculation, the structures were 
prepared using the PDB2PQR server (62) with the 
PARSE force field. Visualization and analysis of 3D struc- 
tures was performed with UCSF Chimera (63). 

RESULTS 

Initially, we identified all putative C-family DNA 
polymerases (PolIIIa subunits and their homologs) 
in 1877 completely sequenced bacterial genomes. Once 
the polymerases (2956 in total) were compiled, we 
selected a representative nonredundant set of bacterial 
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genomes as described in 'Materials and Methods' section. 
This mostly removed closely related strains of the 
same species that otherwise might have strongly biased 
the data. All the subsequent analyses were performed 
using the representative set of 945 bacterial genomes 
coding for 1590 putative C-family DNA polymerases 
(the redundancy of both genomes and polymerases was 
reduced approximately twice). Detailed information 
about each of the representative polymerases is presented 
in Supplementary Table SI. 

Distinct polymerase groups 

The distinction between the two major forms (DnaE and 
PolC) of C-family polymerases has been noticed some 



time ago (17). We used phylogenetic analysis to reveal 
evolutionary partitioning of the C-family at a higher reso- 
lution. Based on the available crystal structures of DnaE 
(7,8) and PolC (16) representatives, we first defined a 
conserved region that is shared by both forms (Figure 1) 
and includes PHP, Pol3 and (HhH) 2 domains. We then 
used this region to analyze the evolutionary relationship 
between C-family polymerases. Phylogenetic analysis 
revealed that the two major forms are well-separated 
(100% bootstrap support) and yet show a different 
degree of diversity (Figure 2). PolC polymerases, typified 
by one of the two essential B. subtilis PolIIIa subunits 
(PolC), represent a single, evolutionary compact, group. 
In contrast, DnaE-type polymerases are significantly more 




Figure 2. Phylogenetic tree of C-family polymerases. Tree colors correspond to four distinct polymerase groups (labeled). The tree is arbitrarily 
rooted at PolC. Bootstrap values, relevant to polymerase group separation, are shown (numbers). The zoomed area shows the DnaE3 branching 
area. Bacterial taxonomy is represented by both a colored strip (inner) and text. Minor (less than five bacteria in the set) taxonomic groups are 
shown only in colors. PolC-containing taxa are shown in bold. For clarity, Clostridia and Negativicutes classes (taxonomically closely related) of 
phylum Firmicutes are merged. 
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diverse. The most distinct among DnaE-type polymer- 
ases is the DnaE2 group (90% bootstrap support). One 
of the best characterized members of this group is 
M. tuberculosis DnaE2, a nonessential error-prone DNA 
polymerase (26,35). Overall, the remaining phylogenetic 
tree is poorly resolved. Nevertheless, a group that 
includes B. subtilis DnaE, the second essential PolIIIa 
subunit in addition to PolC, stands out. Following the 
previously introduced nomenclature (36), we labeled this 
group as DnaE3. The consideration of DnaE3 polymer- 
ases as a distinct group is supported by the high bootstrap 
value (91%). However, there is unavoidable ambiguity 
in putting the exact separation line between the DnaE3 
group and the remaining sequences. Therefore, we chose 
to assign all the sequences that clustered together more 
often than not (bootstrap support >50%, see the 
zoomed section in Figure 2) to the DnaE3 group. The 
remaining sequences were assigned to the DnaEl group. 
A well-characterized representative of the DnaEl group is 
the E. coli PolIIIa subunit, which is the sole high fidelity 
replicative C-family DNA polymerase in the cell. Notably, 
DnaEl is a large and diverse group. Therefore, it might be 
argued that DnaE3 may be considered as just one of many 
subgroups. However, the separation of the DnaE3 group 
is also supported by the analysis of a number of different 
features presented in the sections below. 

To maximize phylogenetic tree resolution, a minimal 
number of strongly diverged sequences were removed 
from the analysis. All of the omitted sequences, except 
three, were present as additional copies in respective 



genomes, suggesting that they are not the primary repli- 
cative polymerases. They were labeled DnaEX, while the 
three essential polymerases were assigned to the specific 
polymerase group based on the sequence similarity alone. 

Genomic distribution 

Having defined distinct polymerase groups, we next 
analyzed which polymerase combinations are observed 
in individual genomes. Are there preferred, avoided or 
even incompatible (not observed) combinations? Based 
on the results of the analysis presented in Table 1, 
several observations can be made. Importantly, we did 
not find a single genome that would entirely lack 
C-family polymerases. This finding implies that bacteria 
universally use C-family polymerases for the replication of 
their genome, and that there are no alternative nonhom- 
ologous functional solutions. The majority of bacteria 
have two, three or even four putative C-family polymer- 
ases. However, a large fraction (41%) of genomes encode 
a single polymerase of the DnaEl type (like in E. coli). 
Consistent with an earlier study (36), we find that DnaEl 
is not only the most widely distributed, but also the only 
one that may exist in a genome either alone or in combin- 
ation with polymerases from other groups. Members of 
other groups (PolC, DnaE3 and DnaE2) are always 
accompanied by at least one representative of a different 
group. PolC always co-occurs with either DnaE3 or 
DnaEl. Unlike widely distributed DnaEl, PolC is 
confined to several bacterial phyla, namely Firmicutes, 



Table 1. Combinations of C-family DNA polymerases in 945 bacterial genomes 



Genome 


Fraction of 


Number of 


Polymerase 


Taxonomic spread 


count 


genomes (%) 


polymerases 


combination 




386 


40.9 


1 


DnaEl 


All except Firmicutes [Bacilli, Erysipelotrychi (1)], Tenericutes, Thermotogae, 
Fusobacteria, Gemmatimonadetes, Ignavibacteria 


282 


29.8 


2 


DnaEl +DnaE2 


Acidobacteria, Actinobacteria, Bacteroidetes, Chloroflexi, Planctomycetes, 
Proteobacteria, Spirochaetes, Synergistetes, Thermodesulfobacteria, 
Verrumicrobia, Gemmatimonadetes (1), Ignavibacteria (1), Nitrospirae (1), 
Thermi (1), Firmicutes [Clostridia (1)] 


40 


4.2 


3 


DnaEl +2xDnaE2 


Actinobacteria, Bacteroidetes, Proteobactcria, Spirochaetes, Acidobacteria (1) 


8 


0.9 


4 


DnaEl + 3xDnaE2 


Bacteroidetes, Proteobacteria 


3 


0.3 


2 


2xDnaEl 


Actinobacteria, Deferribacteres (1) 


3 


0.3 


3 


2xDnaEl +DnaE2 


Actinobacteria, Proteobacteria (1) 


3 


0.3 


3 


DnaEl + DnaE2 + DnaEX 


Actinobacteria, Proteobacteria (1) 


2 


0.2 


2 


DnaEl + DnaEX 


Proteobacteria 


66 


7.0 


2 


PolC + DnaEl 


Firmicutes (Clostridia, Negativicutes), Fusobacteria 


3 


0.3 


3 


PolC + DnaE l + DnaE2 


Firmicutes (Clostridia) 


6 


0.6 


3 


PolC + DnaE 1 + DnaEX 


Firmicutes (Clostridia, Negativicutes) 


1 


0.1 


3 


PolC + 2xDnaEl 


Firmicutes [Clostridia (1)] 


1 


0.1 


4 


PolC + DnaE 1 + 2xDnaEX 


Firmicutes [Negativicutes (1)] 


1 


0.1 


4 


2xPolC + DnaEl + DnaEX 


Firmicutes [Clostridia (1)] 


130 


13.8 


2 


PolC + DnaE3 


Firmicutes [Bacilli, Erysipelotrychi (l) a ], Tenericutes, Thermotogae 


3 


0.3 


3 


PolC + DnaE3 + DnaE2 


Firmicutes [Bacilli, Clostridia(l) h ] 


5 


0.5 


3 


PolC + DnaE3 + DnaEX 


Firmicutes (Bacilli) 


2 


0.2 


3 


2xPolC + DnaE3 


Tenericutes (1), Thermotogae(l) 



Polymerase combinations observed in > 1 % of the analyzed bacterial genomes are emphasized by bold font. Relevant bacterial phyla (and classes of 
Firmicutes in parentheses) are listed in the last column [single occurrences are marked with '(I)']- 

^Erysipelotrychi bacteria have low rRNA similarity to other Firmicutes, some phenotypic traits differ considerably. Previously these bacteria were 
classified with bacteria of current phylum Tenericutes (64). 

b Sulfobacillus acidophilus is a member of sulfobacilli, which were only tentatively assigned as a family of Clostridia (64). According to genomic 
distribution and domain architectures of PolIIIocs, 5. acidophilus is related to Kyrpidia tusciae and Alicyclobacillus acidocaldarius, both belonging to 
the Bacilli class. Previously, these three genera were classified together (64). 
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Fusobacteria, Tenericutes and Thermotogae (Figure 2 and 
Table 1). These mostly include Gram-positive bacteria 
with low genomic GC content. In particular, an interesting 
situation is observed in Firmicutes. In these bacteria, PolC 
is accompanied by DnaE-type polymerases, which, ac- 
cording to phylogenetic analysis, are split into DnaEl 
and DnaE3. The split essentially coincides with the taxo- 
nomic division as DnaE3 polymerases are found in class 
Bacilli, while DnaEl are found in classes Clostridia and 
Negativicutes. Fusobacteria, represented by only five 
genomes in our set, is the only other phylum in which 
PolC was found in combination with DnaEl sequences. 
DnaE3 always co-occurs with PolC, while DnaE2 poly- 
merases almost exclusively go together with DnaEl. 
DnaE2s appear to be distributed randomly, and even 
related bacterial species may differ by the presence 
(absence) of DnaE2. Although in general C-family poly- 
merases are rarely encoded in plasmids (only 3% in our 
set), of those that are, about two-thirds are DnaE2s. This 
observation suggests that horizontal gene transfer may be 
an important route of the DnaE2 dispersal within bacter- 
ial genomes. However, despite the inferred 'mobility' of 
DnaE2 polymerases, they are extremely rarely found in 
PolC genomes (Table 1). This may be interpreted as the 
conflicting overlap between the function of DnaE2 and 
that of a PolC-DnaE3/DnaEl pair. 

Domain architectures 

To better understand differences between distinct poly- 
merase groups, we compiled their domain architectures 
(the composition and the arrangement of structural-func- 
tional domains). To this end, for every polymerase 
sequence we asked which of the conserved domains 
revealed by DnaE and PolC structures (Figure 1) as well 
as any other domains are present and in which order. The 
domain-mapping results are fairly robust as only in a rela- 
tively small number of cases we were unable to map any 
conserved domains with statistically significant values 
for polypeptide chain regions exceeding 50 residues. 
Moreover, regions without the domain assignment occur 
only at sequence termini. 

The survey (Figure 3 and Supplementary Table S2) 
revealed that each polymerase group has a typical archi- 
tecture. At the same time, we observed a significant vari- 
ability within each group. Most often, new variants are 
associated with the loss of one or more domains, but 
there are also cases of the domain gain. The most 
conserved subset of structural-functional domains 
includes the combination of PHP, Pol3 and (HhH) 2 
domains. For the sequence to be annotated as a putative 
C-family DNA polymerase, we required that it should 
have the intact region harboring the polymerase active 
site. Therefore, the conservation of the polymerase core 
(Pol3) is expected. Somewhat surprisingly, PHP and 
(HhH) 2 domains are also nearly universally present, sug- 
gesting their important structural and/or functional role. 

The most common architecture of PolC (86%) is iden- 
tical to that of B. subtilis PolC. It consists of NTD, OB 
domain, PHP with the inserted exonuclease (Exo) domain, 
Pol3 and the (HhH) 2 motif (Figure 3). For a small number 



of PolC sequences (7%), we failed to assign NTD with 
statistically significant values. However, it has been 
observed that PolC NTDs are poorly conserved at the 
sequence level (18). Thus, it may well be that in most 
such cases a strongly diverged NTD is present in the un- 
assigned N-terminal region. One of the hallmarks of PolC 
is the proofreading exonuclease (Exo) inserted into the 
PHP domain. Thus, it was surprising to find PolC se- 
quences (6%) that lack the Exo insertion. Interestingly, 
all such PolC variants (with the single exception of an 
additional PolC copy in Thermotogae) were found exclu- 
sively in Clostridia and Negativicutes, two classes of 
Firmicutes bacteria. Firmicutes belonging to these two 
classes all have PolC paired with DnaEl and not with 
DnaE3 (Table 1). An example of domain expansion, rep- 
resented by a tandem duplication of OB domains, is 
observed in two PolCs from Lactococcus genus. 

The most typical architecture of the DnaEl group 
(91%) is represented by the E. coli PolIIIoc subunit and 
includes PHP, Pol3, (HhH) 2 , OB and CTD domains. The 
PolIII proofreading exonuclease activity in E. coli is 
supplied by the separate s-subunit complexed with a- 
subunit (1). Surprisingly, we detected a fraction of 
DnaEl subunits, all of them in Bacteroidetes, having the 
exonuclease domain as part of the same polypeptide chain. 
Previously, this has been observed only in PolC type a- 
subunits. However, unlike in PolC, the DnaEl exonucle- 
ase domain is not inserted into the PHP domain but 
attached to its N-terminus through a linker (~60 
residues). This difference indicates that the incorporation 
of the exonuclease domain into PolC and DnaEl are un- 
related evolutionary events. Other variants display differ- 
ences in the region, C-terminal to the (HhH) 2 motif. In 
some of these cases, the CTD could not be identified with 
statistically significant values within the unassigned 
C-terminal region. Since CTD, similarly to NTD of 
PolC, is poorly conserved, at least a number of CTDs 
may have escaped identification. One of the minor 
variants features domain expansion (duplicated OB 
domain) in the otherwise canonical architecture. Another 
variant is truncated right after the OB domain, thus 
excluding any possibility of the CTD presence. Also, a 
second DnaEl copy from S-proteobacterium 
Desulfococcus oleovorans (a typical DnaEl is also 
present) has CTD replaced with a small domain related 
to the restriction alleviation protein Lar (PFAM family: 
PF14354), a predicted rubredoxin-like zinc binding 
domain. 

The dominating architecture in the DnaE3 group is the 
same as in the case of DnaEl, but less typical. Only 60% 
of DnaE3s, compared with 91% of DnaEls, have this 
architecture, and a larger fraction of DnaE3s lacks one 
or more domains at the C-terminus. As many as 16% 
are truncated after the OB domain (do not have CTD), 
and additional 4% lack both OB and CTD domains. 
There is even a variant with the excised OB domain but 
with CTD present. DnaE3s in Tenericutes all lack CTD, 
and only 40% of them have an identifiable OB domain. 
Thermotogae sequences, which show the strongest diver- 
gence within the DnaE3 group, all lack recognizable OB 
and CTD domains. 
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Figure 3. Domain architectures of distinct groups of C-family polymerases. Labels of standard polymerase domains are the same as in Figure 1 . 
Nontypical domains: UDG, uracil DNA glycosylase; Lar, a domain related to the restriction alleviation protein Lar. Numbers indicate the total 
number of sequences having this particular domain architecture in the nonredundant set (1590 sequences). 



At the level of domain architecture, DnaE2 polymerases 
differ from both DnaEl and DnaE3 in that not a single 
DnaE2 has the CTD. Although some have protruding C- 
terminal tails, disorder predictions suggest that they are 
mostly unstructured. Several DnaE2 sequences lack not 
only CTD but also the OB domain. 

Of those DnaEs that were not assigned to one of the 
above groups (DnaEX), most have architectures already 
found in DnaEl-3 groups. In an unusual variant, found as 
an additional polymerase in some strains of Yersinia pestis 



and Salmonella enterica, OB and CTD domains are 
replaced with the archaeal-type uracil DNA glycosylase 
domain (65), a member of the PFAM family PF03167. 

In addition to the PHP-Pol3-(HhH) 2 conserved core, 
the OB domain is also nearly universally present 
(detected in at least 97% of polymerases). However, its 
position in relation to the conserved core is completely 
different: in DnaE, it is C-terminal, while in PolC, it is 
N-terminal. This observation inevitably raises a 
question: do OB domains of both PolC and DnaE 



Nucleic Acids Research, 2014, Vol. 42, No. 3 1401 



tRNA synthetase 
(Asp/Asn/Lys) 
archaea, bacteria, eukaryota) 



DNA Pol II 
small subunit 
(archaea) 

Phosphoesterase^ 
RecJ-like 
(archaea) 




SSB 

(bacteria, viruses) 



DnaE2 y m . 



Bacteroidetes 
DnaE2 




3'-5' Exoribonuclease YhaM 
CMP-binding factor 
(bacteria, archaea) 



Exodeoxyribonuclease VII 
large subunit 
(archaea, bacteria) 



RFA2 
(archaea)C. ; 

STN1. 

(plants) I- - 




RecG helicase 
(bacteria) 



RFA1 
(archaea, eukaryota) 



Figure 4. Homologs of PolIIIa OB domains clustered according to sequence similarity using CLANS (66). Each dot represents a single sequence. 
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P-value of le-09 or better are shown. Abbreviations: SSB, single-stranded DNA binding protein; RFA, replication factor A; STN1, a subunit of the 
single-stranded DNA binding CST complex, involved in telomere maintenance. 



derive from the ancestral C-family polymerase or had they 
been acquired independently by DnaE and PolC lineages? 
If the first scenario were true, DnaE and PolC OB 
domains would be expected to be closest to each other. 
Conversely, the second scenario would imply that DnaE 
and PolC OB domains should be more similar to their 
respective parental OB domain families than to each 
other. To distinguish between the two alternatives, we col- 
lected a large number of diverse homologs of both DnaE 
and PolC OB domains (up to 10 PSI-BLAST iterations, 
le-03 E-value cutoff) and clustered them according to all- 
against-all sequence similarities. Clustering revealed that 
DnaE OB domains are connected to a considerable 
number of OB domain families, while PolC OB domains 
are almost exclusively linked to OB domains of DnaE 
(Figure 4). These results imply that OB domain of PolC 
descended from the ancestral DnaE OB domain and not 
from any other source. In other words, the results suggest 
that the last common ancestor of DnaE and PolC already 
had the OB domain and was of the DnaE type, that is, 
with OB domain C-terminal to the polymerase core. 

Structure and surface conservation 

The four polymerase groups differ considerably according 
to the sequence length (Figure 5A). Not surprisingly, PolC 
polymerases are the longest owing to the inserted exonucle- 
ase domain and a fairly long PolC-specific NTD. DnaEl 
polymerases are typically longer than DnaE2 or DnaE3. 
Despite the large difference in the overall length, a common 
feature of PolC and DnaEl polymerases is that their length 



varies in a fairly narrow range. In contrast, the length of 
both DnaE2 and DnaE3 polymerases shows strong 
heterogeneity. Our survey of domain architectures 
(Figure 3) suggested that the length variability to a large 
degree is determined by the absence/presence of structural 
domains in terminal regions. Therefore, we repeated the 
analysis of sequence length distribution for only the evolu- 
tionary conserved core consisting of PHP, Pol3 and 
(HhH) 2 regions (Figure 5B). Strikingly, the core regions 
of PolC and DnaEl have nearly identical length distribu- 
tions. On the other hand, similarly to full sequences, the 
core regions of DnaE3 and DnaE2 are both shorter and 
considerably more heterogeneous. The heterogeneity of 
DnaE3s is mainly due to structural differences in PHP 
and 'thumb' domains. If taxonomy is considered, by far, 
the most distinct DnaE3 polymerases are present in 
Thermotogae . Their PHP domain is significantly smaller, 
as it lacks several structural elements (Supplementary 
Figure S1A). Similar, but less severe reduction of the 
PHP domain can be seen in DnaE3s of most Tenericutes 
and at least some Firmicutes. In addition to the degraded 
PHP domain, Thermotogae DnaE3 polymerases also 
feature a strongly reduced 'thumb' (Supplementary 
Figure SIB). Their 'thumb' lacks a helix-loop-helix motif, 
corresponding to T. aquaticus DnaEl residues 513-552, 
that provides additional contacts with the DNA duplex 
(11) and may reach the downstream template DNA. 
Intriguingly, a similarly reduced 'thumb' that has been 
observed in the crystal structure of G. kaustophilus PolC 
(16) appears to be typical for PolC polymerases. The 
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Figure 5. Polymerase sequence length distributions in four distinct 
groups. Distributions for (A) full-length sequences and (B) only the 
core region [PHP, Pol3 and (HhH) 2 domains; residues 6-889 of 
E. coli DnaEl]. The vertical axis indicates the fraction of polymerases 
in a given length interval (at the step of 10 residues) for each group 
separately. DnaE3 of Thermotogae and DnaE2 of Bacteroidetes 
have distinctly shorter core sequence regions than other polymerases 
in corresponding groups. 



DnaE2 core, similarly to DnaE3, on average is shorter than 
that of PolC or DnaEl, but the reduced PHP and 'thumb' 
domains are mainly confined to DnaE2s of Bacteroidetes. 

Protein surface is often as informative as the structure. 
Differences in patterns of surface conservation may 
indicate the relative importance of functional sites. It 
can be seen in Figure 6 that the active site and some of 
the DNA binding surfaces are highly conserved in all four 
polymerase groups. Strikingly, DnaE3 and even nonessen- 
tial DnaE2 polymerases show the conservation as strong 
as the DnaEl and PolC groups, representing main repli- 
cative polymerases. However, the P-clamp binding site is 
more strongly conserved in DnaEl and PolC groups. 

It has been known for some time that the PHP domain 
of E. coli PolIIIa (DnaEl) harbors the binding site of the 
proofreading s-subunit (67). Recently, Ozawa et al. 
mapped this binding site by solving the crystal structure 
of the C-terminal part of e-subunit fused to the PHP 
domain through a flexible linker (68). Our analysis of 
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Figure 6. Surface residue conservation in different polymerase groups. 
The structures shown are T. aquaticus DnaEl (PDB ID: 3E0D), 
G. kaustophilus PolC (PDB ID: 3F2B), M. tuberculosis DnaE2, 
B. subtilis DnaE3. All structures are shown in same orientation. OB 
domain and CTD were removed for clarity. Relative positions of all 
domains are indicated for DnaEl. The surfaces are colored according 
to ConSurf results: variable — cyan, conserved — maroon. 



the PHP surface conservation shows that DnaEl polymer- 
ases have a conserved patch in the exact position of the 
binding site of the s-subunit (Figure 7). Conserved 
residues forming putative contacts with the s-subunit 
can be easily identified along the whole length of contact- 
ing surface patch (Figure 7). Residues, maintaining the a-s 
interaction in E. coli have been outlined according to the 
chimeric structure (68). At least two residues of the E. coli 
s-subunit C-terminal segment (His225 and Trp241) were 
experimentally shown to be important for maintaining the 
interaction with a-subunit (69,70). sHis225 forms a 
hydrogen bond to Lys63 of the a-subunit, while sTrp241 
is embedded in a conserved hydrophobic pocket. Both 
Lys63 (with the adjacent proline) and the sTrp-binding 
pocket are highly conserved in the DnaEl group, but 
only moderately in DnaE3 polymerases. The rest of the 
putative s-subunit binding surface patch in DnaE3 is even 
less conserved. Also, in structurally distinct Thermotogae 
DnaE3 polymerases, the missing structural elements in the 
PHP domain constitute a large part of the putative s- 
subunit binding surface (Supplementary Figure SIC). 
The corresponding surface region in DnaE2 polymerases 
is not conserved at all, suggesting that they do not bind 
s-subunit or at least not in the same way as E. coli DnaEl. 

Electrostatic properties 

Owing to different functional roles, different polymerase 
groups might be expected to have distinct electrostatic 
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Figure 7. Surface conservation of the PHP domain at the putative s- 
binding site in the three DnaE groups. Top left panel shows the full 
structure of T. aquaticus DnaEl, colored according to the domain or- 
ganization. The position of the CTD of s-subunit (blue ribbon) corres- 
ponds to that obtained for the E. coli DNA polymerase IIIoc-s chimera 
(PDB ID: 4GX9). Remaining panels show surface conservation of only 
the PHP domain for each of the three DnaE groups. PHP domains 
with the overlaid C-terminal segment of s-subunit (blue ribbon) are 
shown in the same orientation as in the top left panel. Most conserved 
residues contacting the s-subunit are indicated with red arrows. The 
pocket where Trp241 of the s-subunit is bound is also indicated. 



properties. To make a proper comparison between poly- 
merase groups, similarly to the length analysis, we only 
used the common conserved core [PHP, Pol3 and (HhH) 2 ]. 
As a simple initial test, we computed theoretical isoelectric 
points (pi) for each protein. Despite simplicity of the 
approach, the pi calculation results revealed striking 
parallels with the results of sequence length analysis. 
Both the average pi values and the pi distribution for 
DnaEl and PolC are almost identical (Figure 8). DnaE2 
and DnaE3 groups both have on average higher pi values 
than DnaEl/PolC. Furthermore, in contrast to DnaEl/ 
PolC, pi values for DnaE2 and DnaE3 display broad dis- 
tribution. A more detailed analysis in different subgroups 
(Supplementary Figure S2) revealed that extreme pi values 
tend to coincide with major structural deviations. For 
example, DnaE3 polymerases of Thermotogae and 
Tenericutes featuring significant loss of structural 
elements have the highest pi values (averages of 7.9 and 
7.6, respectively). In Figure 8, it can be seen that high pi 
values are also observed for a few DnaEl and PolC poly- 
merases. Interestingly, most of these 'unusual' DnaEl s 
and PolCs are found in bacteria with tiny (< 1 Mb) AT- 
rich genomes (Supplementary Figure S3). DnaEl poly- 
merases with high pi values are almost exclusively found 
in insect symbionts, while PolCs are mostly found in 
mycoplasmas and phytoplasmas. In contrast, high pi 
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Figure 8. Predicted polymerase isoelectric point (pi) value distributions 
in four distinct groups, pi values were calculated for the core region 
[PHP, Pol3 and (HhH) 2 domains; residues 6-889 of E. coli DnaEl]. The 
vertical axis indicates the fraction of polymerases at a given pi value 
(with a step of 0.1) for each group separately. 



values in DnaE3 and DnaE2 groups are not specifically 
associated with small genome size or low GC content. 

The charge distribution on the surface of polymerases is 
unbalanced (Supplementary Figure S4). Typical DnaEl 
and PolC polymerases with low pi (pi < 6) have positive 
charge patches predominantly in the DNA binding 
groove. The increase in pi values appears to be associated 
both with the increase in positive charge within the DNA 
binding groove and the nonspecific dispersal of positive 
charges. Comparison of several DnaEl/DnaE2 and 
PolC/DnaE3 pairs originating from the same species illus- 
trates a more strongly pronounced positive charge of the 
DNA binding groove in DnaE2 and DnaE3 compared 
with the corresponding DnaEl and PolC polymerases 
(Supplementary Figure S4). These observations suggest 
that typical DnaEl and PolC polymerases bind the 
DNA less strongly than DnaE2 or DnaE3. 

Functional motifs 

Sequence motifs may define important functional charac- 
teristics. Therefore, we analyzed how known functional 
motifs differ in distinct polymerase groups and asked 
whether there are some novel motifs. 

Polymerase active site and its neighborhood 

The polymerase active site aspartic residues [E. coli D401, 
D403 and D555 (7,71)] that coordinate catalytic magne- 
sium ions are absolutely conserved in the entire C-family 
(Supplementary Figure S5). If we consider the immediate 
vicinity, PolC clearly differs from DnaE groups, which 
all display a similar conservation pattern. For example, 
additional absolutely conserved Asp (E. coli, D405) is 
present in all three DnaE groups, but is replaced by Asn 
(B. subtilis, N970) in PolC. DnaE2 appears to have slightly 
distinct active site neighborhood in comparison with 
either DnaEl or DnaE3. However, the polymerase 
active site is highly conserved in all groups and it is 
unclear to what degree differences in the neighborhood 
are significant. 
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PHP metal binding site 

In contrast to the highly conserved polymerase active site 
region, the PHP domain turned out to be much more 
informative. It has been associated with a novel Zn 2+ - 
dependent proofreading exonuclease activity in at least 
some C-family polymerases (11,72). We analyzed the 
conservation of nine PHP positions (Figure 9A and B) 
that in G. kaustophilus PolC and T. aquaticus DnaEl 
crystal structures are involved in metal binding 
(11,16,73). Results show (Figure 9C) that these positions 
display distinct levels of conservation in different groups. 
PolC has all nine positions strongly conserved. Similar, 
albeit less strong, conservation is observed in the DnaEl 
group. In contrast, the corresponding positions in 
DnaE2 and even more so in DnaE3, display little if any 
conservation. 

Since polymerase groups, in particular DnaEl, span a 
variety of bacterial lineages, we also looked at the PHP 
metal binding site in more detail, taking into account 
bacterial taxonomy (Supplementary Figure S6). It turned 
out that DnaEl sequences of three major classes of 
Proteobacteria (a, p and y) all have extensive substitutions 
in the PHP active site. Notably, not a single DnaEl in 
these bacteria possesses all nine (or even eight) residues 
from the consensus pattern. This finding is consistent 
with previous studies suggesting that at least some of 
Proteobacteria might have an inactivated PHP 



exonuclease active site (74,75). DnaEl polymerases in 
Bacteroidetes and Fusobacteria also display a significant 
variability in the PHP metal binding positions. In 
contrast, DnaEl polymerases in Firmicutes and all of the 
remaining bacteria, including S- and z-proteobacteria, 
display extremely high conservation of the PHP metal 
binding site. Although as a group, DnaE2 polymerases 
do not have the conserved canonical pattern, a small 
fraction of DnaE2s have the PHP active site mostly 
intact. These polymerases belonging to a subgroup of 
Actinobacteria (including M. tuberculosis) appear to 
form a distinct clade in the phylogenetic analysis. On the 
other hand, none of the DnaE3 subgroups showed any 
conservation, suggesting that the defective PHP metal 
binding site is a hallmark of DnaE3 polymerases. 

Potentially, DnaE polymerases could have two different 
types of exonuclease activity, one due to the bound s- 
subunit and the second due to the exonuclease activity 
of the PHP domain. An interesting question is whether 
the two exonuclease activities are mutually exclusive? 
Recently, it was suggested that they likely might be (75). 
In such case, it might be expected that DnaEl polymerases 
with the nonfunctional PHP metal binding site would 
maintain a conserved s-binding site, while those with the 
functional PHP active site would not. We find that 
Proteobacteria DnaEl polymerases, featuring extensive 
substitutions in the PHP active site (Supplementary 
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Figure 9. Conservation of metal binding residues in the PHP domain. (A) Schematic representation of the secondary structure for the PHP region 
(1-290) of T. aquaticus DnaEl. The positions of nine metal-coordinating residues are indicated with red arrows. (B) The structures of the PHP metal 
binding site in T. aquaticus DnaEl (PDB ID: 4IQJ, left) and G. kaustophilus PolC (PDB ID: 3F2D, right) with residue names and numbers labeled. 
Corresponding residues in E. coli DnaEl and B. subtilis PolC are indicated in parenthesis. The same superimposed structures are shown in the 
middle. (C) Sequence logo representation of the conservation of corresponding positions in different polymerase groups. Positions are labeled 
according to a representative from each group: E. coli DnaEl, M. tuberculosis DnaE2, B. subtilis DnaE3 and PolC. 
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Figure S6), indeed have a strongly conserved putative s- 
binding site (Supplementary Figure S7). Nonetheless, 
DnaEls of other bacteria with the perfectly conserved 
PHP active site (Supplementary Figure S6) still retain con- 
siderable surface conservation (Supplementary Figure S7). 
In other words, the presence of the intact PHP metal 
binding site does not seem to preclude the s-subunit 
binding. Apparently, the differences between DnaE 
groups are more relevant. For example, DnaE-type poly- 
merases in Firmicutes, partitioned into DnaEl and DnaE3 
groups (Figure 2), show the opposite trends 
(Supplementary Figure S7). These DnaEl polymerases 
have both the intact PHP metal binding site and a 
strongly conserved putative s-binding surface patch. In 
contrast, DnaE3s have both the disrupted metal binding 
site and almost no conservation of the PHP surface. 

DNA sliding clamp binding motif 

Bacterial replicative DNA polymerases interact with the 
DNA sliding clamp to achieve high speed and 
processivity. The interaction is mediated by a short 
sequence motif within PolIIIa subunit. The consensus 
P-clamp binding motif has been identified previously as 
the pentapeptide QL[S/D]LF (76) in which positions 1, 4 
and 5 appear to be the most important (77,78). We found 
that the presence and nature of the P-clamp binding motif 
vary considerably between different polymerase groups 
(Figure 10). The PolC group almost universally (98% of 
sequences) features the consensus motif QLSLF that is 
closest to the previously established consensus P-clamp 
binding motif. The P-clamp binding motif, also present 
in nearly all (99%) DnaEl sequences, is less conserved 
than in PolC, yet is still close to the consensus. In particu- 
lar, the first (Q) and the last two positions (LF) are well 
conserved. In contrast, we identified a putative motif in 
only 75% of DnaE3 polymerases. Furthermore, only the 
positions 4 and 5 of the motif are conserved. DnaE2 
sequences are also heterogeneous in respect to the 
presence of the p-clamp binding motif. About 85% of 
DnaE2s have an identifiable motif, but it is visibly differ- 
ent from the canonical one. More specifically, positions 1 
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Figure 10. The P-clamp binding motif in different polymerase 
groups. Residues are labeled according to the same representatives as 
in Figure 9. 



and 3 are dominated by proline, which is not present in 
corresponding positions in other polymerase groups. 

Like in the case of PHP metal binding site, a more 
detailed look at taxonomy-based p-clamp binding motifs 
revealed significant variation (Supplementary Figure S8). 
For example, DnaEl polymerases in Deferribacteres 
(five sequences) and a small fraction of Bacteroidetes 
(five sequences) do not have a recognizable P-clamp 
binding motif at all. In these bacteria, DnaEl appears 
to be the primary replicative polymerase, raising a 
question of how the replication processivity is achieved 
in those cases. Notably, DnaEl sequences that co-occur 
with PolC in Clostridia (Firmicutes) have a motif typ- 
ical to other DnaEl sequences and not to the 'weak' 
motif of DnaE3 (also co-occurring with PolC) 
(Supplementary Figure S8). A small fraction of DnaE3 
sequences (14%), most of them in bacteria of the order 
Bacillales, also have motifs similar to those of DnaEl 
or PolC (QxxLF, where x is any residue). B. subtilis is 
one of these bacteria and its DnaE3 has a relatively 
'strong' motif (QMGLF). 

Several different putative p-clamp binding motifs can be 
identified in DnaE2 polymerases. The distinct motif 
(PLPLF) is predominantly found in a-proteobacteria. 
A similar consensus motif (xLPLF) is typical for 
y-proteobacteria, but /3-proteobacteria, although taxonom- 
ically close to y-proteobacteria, seem to have a different 
and even 'weaker' consensus motif, xxxLL. Some 
Actinobacteria seem to have a nontypical QLPLx motif, 
which in almost half of the cases can be extended to the 
hexameric QLPLxL motif, similar to the consensus motif 
of Hda, DnaA-related protein (77,79). 

Putative protein-protein interaction motif in DnaE2 

In addition to the known functional sites, we also searched 
for any other conserved motifs that stand out in at least 
one of the groups. One such motif is associated with 
~77% of DnaE2 sequences. This motif, noticed previously 
(35), features the SRDF[H/R] consensus sequence at the 
very C-terminus (Supplementary Figure S9A). The motif 
is part of the region, predicted to be intrinsically unstruc- 
tured (Supplementary Figure S9B). High conservation 
combined with the lack of a defined structure typically 
is the signature of a protein-protein interaction motif, 
but in this case its specific function remains to be estab- 
lished. This motif is found in the majority of DnaE2s 
of Proteobacteria (except most of S-proteobacteria), 
part of Actinobacteria (including M. tuberculosis), a few 
Bacteroidetes and some minor phyla. Interestingly, all of 
these DnaE2 sequences clustered together in the phylogen- 
etic tree (bootstrap value 97%), despite the omission of 
the motif during the tree construction. 

Polymerase combinations and global characteristics of 
the bacterial cell 

Our analyses presented above indicate that there are sig- 
nificant differences between distinct groups of C-family 
polymerases. In addition, both the number and the type 
of polymerases encoded in genomes vary considerably. 
It might be expected that different polymerase combin- 
ations may represent different functional capabilities 
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Figure 11. The relationships between different polymerase combinations and genome size (A and B) or genomic GC content (C and D). Data for all 
genomes (A and C) and genomes that do not encode PolC (B and D) are depicted separately. The remaining nontypical polymerase combinations are 
incorporated into those shown. The relationships between genome size and genomic GC content (where one component is binned and the other 
component is the calculated average for each bin) are depicted as secondary graphs (gray). 'E' indicates DnaE, 'PC, PolC. 



pertaining to DNA replication such as speed and mutation 
bias and/or dealing with oxidative stress exerted onto 
DNA. Therefore, we asked whether polymerase combin- 
ations correlate with global characteristics of bacterial 
species such as the genome size, genome GC content and 
the use of oxygen. 

First, we looked at the genome size. We divided 
genomes into several bins according to their size and 
asked which fraction of genomes in each bin has a par- 
ticular polymerase combination. The results revealed 
several clear trends (Figure 11A and B). Except for the 
smallest genomes, the fraction of bacteria carrying sole 
DnaEl decreases as genomes become larger. At the 
same time, the fraction of genomes that, in addition to 
DnaEl, encode one or more DnaE2 increases dramatically 
(Figure 11 A). A steady decrease with the increase in 
genome size is also observed for the fraction of genomes 
represented by the PolC and DnaE3 combination. 
However, PolC-carrying bacteria are restricted to only a 
few phyla. Therefore, we also looked at only those 
genomes that do not have PolC and are distributed 
throughout the bacteria kingdom. The picture does not 
change significantly, yet the opposite trends corresponding 



to DnaEl alone and DnaEl accompanied by one or more 
DnaE2 become clearer (Figure 1 IB). 

Next, we looked at the GC content of bacterial genomes. 
It varies from 14% for Candidatus Zinderia insecticola, 
a member of fi-proteobacteria (80) to 75% for acti- 
nobacterium Cellulomonas fimi (81). As in the case of 
genome size, we grouped genomes into several bins accord- 
ing to the GC content and looked at the spectrum of 
polymerase sets represented in each bin. Again, similarly 
to the genome size analysis, there is a clear trend 
(Figure 11C). The fraction of genomes with single DnaEl 
is about the same up to ~60% GC. For GC-rich genomes 
(>60% GC) it decreases, while the fraction of those coding 
for DnaE2 in addition to DnaEl increases dramatically. 
Combinations that include PolC and either DnaE3 or 
DnaEl essentially disappear from genomes with the GC 
content >60%. If we consider only non-PolC genomes 
(Figure 11D), the picture becomes similar to that of 
dependency on the genome size (Figure 1 IB). The only 
difference is that the increasing number of DnaE2 polymer- 
ases is associated with the increasingly larger genome size, 
but the increase of GC seems to coincide only with the 
presence and not the number of DnaE2s. 
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Figure 12. The relationship between different polymerase combinations 
and the use of oxygen by corresponding bacteria. Bacterial species 
(genomes) are divided into three categories: (i) oxygen using, including 
aerobes, facultative aerobes and microaerophiles (blue), (ii) anaerobes 
(yellow) and (iii) those for which data regarding the use of oxygen is 
unavailable (gray). 



The use of oxygen in bacterial metabolism is associated 
with the oxidative damage to the DNA. We therefore 
asked whether the aerobiosis has any correlation with 
the polymerase combinations in genomes. We divided 
bacteria into two broad groups. The first group, oxygen- 
using bacteria, included aerobes, facultative aerobes/ 
anaerobes (preferentially use oxygen if available) and 
microaerophiles (require oxygen, but only at low levels), 
while the second group consisted of anaerobes. Genomes 
with DnaEl as a sole C-family polymerase are common 
for both groups (Figure 12). However, bacteria that have 
DnaEl and any number of DnaE2s are almost exclusively 
oxygen using. Interestingly, there is a sharp contrast 
between the two groups of PolC-carrying bacteria. 
Those that have PolC + DnaEl nearly all are anaerobes, 
while those having PolC + DnaE3 are predominantly 
oxygen-using bacteria. 

Taken together, these results show that specific poly- 
merase sets encoded in a given genome strongly correlate 
with genome size, GC content and oxygen requirement. In 
particular, it appears that the presence of DnaE2 together 
with DnaEl is linked to bacteria featuring large GC-rich 
genomes and living in aerobic environments. However, 
this does not necessarily imply the causal relationship as 
genome size and GC content are also correlated to each 
other (secondary axes in Figure 11). We therefore sought 
any other data that would either support or contradict the 
idea that the presence of DnaE2 might influence genome 
properties (size, GC content or both). To this end, we 
analyzed whether there is any correlation between electro- 
static (pi) and structural (length) properties of DnaE2 
polymerases and either genome size or GC content 
(Supplementary Figures S3 and S10). The strongest cor- 
relation (Spearman's rank correlation coefficient p = 0.58) 
was observed between the length of the DnaE2 polymer- 
ase core (PHP, Pol3 and (HhH) 2 ) and GC content of the 
genome, while no correlation was found with the genome 
size (Supplementary Figure S10). Moreover, no significant 
correlation with either GC content or genome size was 



observed for DnaEl, DnaE3 or PolC polymerases. The 
variation in sequence length of DnaE2 polymerases is 
mainly due to additional structural elements or deletions 
in the PHP and 'thumb' domains. Functional importance 
of these differences is not obvious, but the correlation with 
GC content provides another hint for the possible involve- 
ment of DnaE2 in shaping genomic GC content. 

DISCUSSION 

Our results based on a representative set derived from 
almost two thousand genomes showed that all bacteria 
have at least one C-family DNA polymerase. This 
strongly suggests that members of this family are principal 
genome replication enzymes throughout the bacterial 
world. C-family polymerases come in two major forms, 
DnaE and PolC, inferred to have evolved by ancient du- 
plication (17). Since DnaE and PolC differ in the exact 
domain composition and arrangement, an interesting 
question is the nature of the ancestral form and the 
pathway that led to two extant forms. Our results 
combined with the available DnaE (7,8) and PolC (16) 
crystal structures indicate that both forms have a 
common universally conserved region that includes PHP, 
polymerase core (Pol3) and (HhH) 2 domains. One other 
domain present in almost all members of C-family is the 
OB domain, which, however, in DnaE and PolC is 
attached to the opposite ends of the universally conserved 
region. Our results suggest that the OB domain of PolC 
derives from the DnaE-type OB domain (Figure 4). The 
simplest scenario, consistent with the common descent and 
the opposite location of OB domains in DnaE and PolC, 
involves the duplication with circular permutation of an 
ancestral DnaE-type polymerase (Figure 13). This 
scenario further implies that additional terminal regions 
have been independently acquired in PolC and DnaE 
lineages after the duplication. Again this is consistent 
with the observation that terminal regions are unique in 
PolC and DnaEl/DnaE3 (18). The analysis of coevolution 
patterns of DnaE and PolC polymerases (82) supports the 
ancient DnaE hypothesis. It was shown that PolC 
coevolved with some genes of the RNA degradation 
pathway, found exclusively in PolC-containing bacteria, 
while DnaE coevolved with proteins found throughout 
all bacterial phyla (82). After the emergence, PolC has 
apparently evolved as a highly specialized DNA polymer- 
ase that has to be complemented with a DnaE-type poly- 
merase to form a fully functional replicase. In contrast, 
DnaE has evolved into different groups having different 
functional capabilities. DnaEl is the most versatile as it is 
the only type of C-family polymerases that can replicate 
genome by itself (as in E. coli). DnaE3 is never found 
alone in any genome (always with PolC), indicating that 
it is specialized to complement the PolC function. DnaE2 
is also never found alone consistent with its role of a 
nonessential polymerase, not involved in bulk DNA 
replication (26-32). 

DnaE2 provides a clearest link between the domain 
architecture and functional specialization. None of 
DnaE2 polymerases have the CTD found in both 
DnaEl and DnaE3 groups. Using mutational studies, 
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genetic screens (2,78) and X-ray crystallography (73), it 
has been shown that CTD is important for binding Pol 
III x-subunit. Pol III x-subunit is part of the clamp loader 
complex and also a central coordinator of the replisome, 
as it interacts with both the replicative helicase and the a- 
subunit of Pol III. Therefore, the absence of the x-inter- 
acting domain (CTD) implies that DnaE2 polymerases do 
not interact with x-subunit (at least not in the same way as 
DnaEl). Interestingly, the domain composition of DnaE2 
coincides with that of the presumed ancestral form of the 
C-family polymerase (Figure 13). It may be that DnaE2 is 
a direct descendant; an alternative possibility is that it 
had evolved from DnaEl /DnaE3 by subsequent loss of 
the CTD. 

DnaEl and DnaE3 groups are difficult to resolve by 
phylogenetic analysis, and the typical domain organization 
is the same for both. Therefore, it is puzzling that the func- 
tional versatility of these two groups is significantly differ- 
ent: DnaEl can function by itself, while DnaE3 has to be 
always accompanied by PolC. A closer comparison of 
domain architectures reveals a significant heterogeneity 
within the C-terminal part of DnaE3 compared with 
DnaEl polymerases. A significant fraction of DnaE3s do 
not have CTD or even the OB domain. One of the possible 
explanations of the observed heterogeneity is that OB and 
CTD domains are not critically important for the DnaE3 
function. At least for CTD there is experimental evidence 
that its role is different in E. coli DnaEl and B. subtilis 
DnaE3. Both proteins have identical domain architectures. 
In E. coli, CTD is responsible for the formation of a stable 
complex between DnaEl and the clamp loader x-subunit 
(78,83). This interaction retains DnaEl within the 
replisome. In contrast, in vivo experiments in B. subtilis 
showed that the retention of DnaE3 at the active replica- 
tion fork is entirely dependent on the interaction with the 
C-terminal tail of the single-stranded DNA binding protein 
(SSB) (84). These data indicate that B. subtilis DnaE3 
either does not bind x-subunit at all, or that this interaction 
is too weak in the absence of SSB. Unlike DnaE3, the 
retention of B. subtilis PolC at the replication fork does 
not depend on SSB (84), supporting the proposed role 
of its NTD for interaction with the clamp loader (18). 



Conservation of sequence or structure motifs provides 
additional hints regarding functional differences. In this 
regard, the PHP domain proved to harbor the most distin- 
guishing features of the four polymerase groups. One of 
the known functions of the PHP domain is the binding 
of the 8-subunit, which is the primary proofreader in the 
replisome of Gram-negative bacteria (1). Recently, the 
structure of a chimera composed of the PHP domain of 
E. coli DnaEl and the e-subunit has been determined, 
revealing the exact interaction site between these two 
subunits (68). This enabled us to ask how well the corres- 
ponding putative E-subunit binding site is conserved 
among different polymerase groups. DnaEl polymerases 
show high conservation all over the putative interaction 
site. On the other hand, DnaE3 and even more so DnaE2 
polymerases show little conservation (Figure 7 and 
Supplementary Figure S7), implying that these polymer- 
ases most probably do not bind an s-subunit, at least not 
in a similar way as E. coli DnaEl. However, separate (or 
integral, in the case of PolC) exonuclease might be not the 
only means to harbor exonuclease activity by C-family 
polymerases. At least in some polymerase a-subunits, the 
PHP domain has been found to be associated with a Zn 2+ - 
dependent proofreading activity (11,72). But are these just 
some unique cases or might the PHP-dependent proofread- 
ing activity be more widespread? PHP domains of PolC 
and DnaEl (with some exceptions including three major 
classes of Proteobacteria) show highly conserved metal 
binding site (Figure 9 and Supplementary Figure S6). 
In contrast, DnaE3 polymerases show no conservation of 
corresponding positions whatsoever. Most of DnaE2 poly- 
merases except a subgroup of Actinobacteria also lack the 
intact metal binding site. Therefore, our results suggest 
that DnaE3 and most of DnaE2 polymerases are devoid 
of proofreading activity. This is the actual case for 
B. subtilis DnaE3, for which no exonuclease activity 
could be detected (85). Perhaps the inferred lack of any 
exonuclease activity might also be linked to the inherently 
low fidelity observed for some DnaE3 polymerases (22,85). 

Another difference between polymerase groups emerges 
if we consider the P-clamp binding motif. This motif 
mediates the interaction between a DNA polymerase 
and the DNA sliding clamp, thereby dramatically 
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increasing processivity of the polymerase. The p-clamp 
binding motif in PolC polymerases is close to the ideal 
consensus, which is also one of the most potent P-clamp 
binding variants (76,78). The consensus motif in DnaEl 
sequences shows higher variability, but is still reasonably 
conserved (except the second position). In contrast, in the 
DnaE3 group, only two last hydrophobic positions are 
conserved. Moreover, only about three quarters of 
DnaE3 sequences have this 'weak' P-clamp binding 
motif. Notably, some DnaE3 polymerases such as the 
one in B. subtilis do contain relatively 'strong' P-clamp 
binding motifs. However, considering the overall low con- 
servation of P-clamp binding motifs in other DnaE3 
polymerases, the B. subtilis case seems to be more of an 
exception rather than a rule. Although most DnaE2s do 
have the P-clamp binding motif, it is noncanonical with 
only last two positions showing a conservation pattern 
typical for other C-family polymerases. The observed 
low conservation of p-clamp binding motif in DnaE3 
and DnaE2 groups suggests that there is little evolutionary 
pressure to retain a strong interaction with p-clamp. 

Some interesting observations can be made from our 
survey of structural and electrostatic properties. Despite 
differences in domain architecture and functional versatil- 
ity, PolC and DnaEl share some remarkably similar 
features. Both polymerase groups have nearly identical 
length and narrow variability of the evolutionary core. 
Members of both groups also have a generally negative 
surface charge with positive charge being located mainly 
in the DNA binding groove. Similarly to the length of the 
core, the charge variance within these two replicative poly- 
merase groups is rather low. Although the DnaE3 group 
also represents essential replicative polymerases, a large 
heterogeneity in both polymerase core length and 
surface charge distribution makes it more similar to 
nonessential DnaE2 polymerases. Moreover, the surfaces 
of the latter two groups are generally more positively 
charged than those of DnaEl or PolC, suggesting a 
stronger DNA binding. Perhaps the elevated positive 
charge, in addition to the lack of exonuclease activity, 
might also contribute to the observed mutagenic character 
of DnaE2 and DnaE3 polymerases (22,26,34,85). 

Genomic distribution of specific sets of polymerases 
revealed a picture, similar to that obtained with a 
smaller number of complete genomes (36). DnaEl, either 
as a single polymerase or in combination with one or more 
DnaE2s, is present in over three quarters of bacterial 
genomes that represent a variety of bacterial phyla. 
Another significant presence is made by PolC 
accompanied by either DnaE3 or DnaEl. However, this 
particular combination is typical for a narrow phyletic 
group, which mostly consists of low-GC Gram-positive 
bacteria. Taking this narrow distribution into account, it 
may not be so surprising that DnaE3s are absent from 
DnaEl -containing genomes. Perhaps more surprising is 
the observation that despite their wide dispersal, DnaE2s 
are infrequently found in PolC-containing genomes. 
DnaE2s are typically found as part of SOS-inducible 
mutagenic cassette identified in many bacterial genomes 
(27,34). At least some DnaE3s are also error-prone 
(22,85) and SOS-inducible (85). A possible explanation 
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of the DnaE2 avoidance in PolC + DnaE3 genomes is that 
it might be disadvantageous for bacterium to have two 
related SOS-inducible polymerases. 

Summing up all the observations about different poly- 
merase groups, three distinct replication systems corres- 
ponding to DnaEl, PolC + DnaE3 and PolC + DnaEl 
polymerase sets could be outlined (Table 2). The first two 
systems are represented by E. coli and B. subtilis, respect- 
ively. In E. coli, the DnaEl polymerase performs chromo- 
somal DNA synthesis all by itself (19). In B. subtilis, PolC 
does the bulk DNA synthesis of both DNA strands, but 
needs help from DnaE3 to extend RNA primers (24). The 
PolC + DnaEl variant is intriguing, as it contains seem- 
ingly two highly efficient polymerases capable of bulk 
DNA synthesis. Unfortunately, there appears to be no ex- 
perimentally characterized PolC + DnaEl systems, despite 
their presence in prominent pathogens like Clostridium 
difficile or Clostridium botulinum. DnaEl polymerases 
co-occurring with PolC have a typical P-clamp binding 
motif, a conserved putative s-subunit binding surface and 
a perfect PHP metal binding site, all the properties of 
DnaEl that can replicate the entire genome by itself. 
Could it be that normally DNA is replicated by DnaEl 
and that PolC is involved only in specific situations? 
Such a view is at least partially supported by the observa- 
tion that, although the majority of Clostridia (and 
Negativicutes) have both PolC and DnaEl polymerases, 
some have only a single DnaEl. Furthermore, all 
PolC lacking the integral exonuclease are only found in 
clostridial genomes. Alternatively, it may be that PolC 
and DnaEl work together at the replication fork and 
that the replication process is even more complex. 
Experimental approaches are clearly needed to gain know- 
ledge on how the PolC + DnaEl replication system really 
functions. 

Based on our proposed evolutionary schema (Figure 13), 
it is tempting to speculate that the PolC + DnaEl replica- 
tion system might be most similar to an ancestral two- 
polymerase system. Consequently, Firmicutes Clostridia 
would be placed at the root of bacterial evolution. This is 
supported by the observation that two different replication 
systems are found in Clostridia: PolC + DnaEl and a single 
DnaEl. Deep branching of Firmicutes [and in particular 
class Clostridia (86)] hypothesis is not novel and has been 
proposed earlier (86,87). The analysis of both replication 
initiation in Gram-positive bacteria and coevolution 
patterns of DnaE and PolC polymerases also suggest 
Firmicutes to be the most ancient bacteria (82,88). If this 
is the case, the evolution of C-family polymerase systems 
might have taken two different paths: toward the single 
DnaEl (by losing PolC) and toward the PolC + DnaE3 
system (DnaEl evolving into DnaE3 due to the relaxed 
selection pressure on several of its functions). 

There have been previous reports suggesting the 
involvement of PolIIIoc subunits in shaping global 
properties of the genome such as the GC content 
(89,90). To further address this issue, we investigated the 
relationship between the polymerase combinations and 
well-defined global cell properties, namely genome size, 
GC content and the use of oxygen. For all of these 
properties, we identified clear trends. However, since 



PolC + DnaE3/DnaEl combinations are found in a phylo- 
genetically narrow group of bacteria, we mostly focused 
on DnaEl /DnaEl + DnaE2 genomes that are not confined 
to specific bacterial phyla. Nonetheless, it is worth 
pointing out that PolC carrying bacteria are sharply 
divided according to the use of oxygen. PolC + DnaEl 
combinations almost exclusively are anaerobes, whereas 
PolC + DnaE3 are typical for oxygen-using bacteria. 
Whether this division is based on functional differences 
of DnaEl and DnaE3 or is just a consequence of a 
narrow phylogenetic distribution remains an open 
question. In DnaE 1 /DnaE 1 + DnaE2 genomes, we found 
a clear dependence between the presence of DnaE2 and 
the increase of both the genome size and the GC content. 
Oxygen-using bacteria also tend to have DnaE2 more 
often. In other words, DnaE2 seems to be typical of 
oxygen-using bacteria having large GC-rich genomes. 
Genome size and the genomic GC content are the result 
of combination of various endogenous cell processes, 
environmental factors and selection pressure. Therefore, 
the relationship that we observed might simply be coinci- 
dental with some other important factors. In particular, 
it is well-known that there is a correlation between the 
genome size and the GC content so that large genomes 
are typically GC-rich (91-94). However, mutational 
spectra of replicative or repair DNA polymerases may 
directly contribute at least to the variation of the GC 
content. Although by now it is generally accepted that 
the overall mutation bias in bacteria is toward the lower 
GC content (95-97), there is a possibility that DnaE2 
might contribute in offsetting or even reversing this bias. 
Supporting this idea is our observation that the length of 
DnaE2 polymerases is correlated with the GC content but 
not the genome size and that none of the groups of essen- 
tial replicative polymerases shows similar correlation. 

SUPPLEMENTARY DATA 
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